{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7.3 String Manipulation（字符串处理）\n",
    "\n",
    "python很多内建方法很适合处理string。而且对于更复杂的模式，可以配合使用正则表达式。而pandas则混合了两种方式。\n",
    "\n",
    "# 1 String Object Methods（字符串对象方法）\n",
    "\n",
    "大部分string处理，使用内建的一些方法就足够了。比如，可以用split来分割用逗号区分的字符串："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "val = 'a,b, guido'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', 'b', ' guido']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val.split(',')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "split经常和strip一起搭配使用来去除空格（包括换行符）:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', 'b', 'guido']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pieces = [x.strip() for x in val.split(',')]\n",
    "pieces"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以使用+号把::和字符串连起来："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "first, second, third = pieces"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a::b::guido'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first + '::' + second + '::' + third"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "但这种方法并不python，更快的方法是直接用join方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a::b::guido'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'::'.join(pieces)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其他一些方法适合锁定子字符串位置相关的。用in关键字是检测substring最好的方法，当然，index和find也能完成任务："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'guido' in val"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val.index(',')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val.find(':')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意index和find的区别。如果要找的string不存在的话，index会报错。而find会返回-1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "substring not found",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-11-280f8b2856ce>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mval\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m':'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m: substring not found"
     ]
    }
   ],
   "source": [
    "val.index(':')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "count会返回一个substring出现的次数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val.count(',')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "replace会取代一种出现方式（pattern）。也通常用于删除pattern，传入一个空字符串即可："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a::b:: guido'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val.replace(',', '::')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ab guido'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val.replace(',', '')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里一些内建的string方法：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/m643y.png)\n",
    "\n",
    "# 2 Regular Expressions（正则表达式）\n",
    "\n",
    "正则表达式能让我们寻找更复杂的pattern。通常称一个表达式为regex，由正则表达语言来代表一个字符串模式。可以使用python内建的re模块来使用。\n",
    "\n",
    "> 关于正则表达式，有很多教学资源，可以自己找几篇来学一些，这里不会介绍太多。\n",
    "\n",
    "re模块有以下三个类别：patther matching（模式匹配）, substitution（替换）, splitting（分割）。通常这三种都是相关的，一个regex用来描述一种pattern，这样会有很多种用法。这里举个例子，假设我们想要根据空格（tabs，spaces，newlines）来分割一个字符串。用于描述一个或多个空格的regex是`\\s+`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "text = \"foo    bar\\t baz  \\tqux\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'baz', 'qux']"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.split('\\s+', text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当调用`re.split('\\s+', text)`的时候，正则表达式第一次被compile编译，并且split方法会被调用搜索text。我们可以自己编译regex，用re.compile，可以生成一个可以多次使用的regex object："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "regex = re.compile('\\s+')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'baz', 'qux']"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regex.split(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果想要得到符合regex的所有结果，以一个list结果返回，可以使用findall方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['    ', '\\t ', '  \\t']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regex.findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 为了防止\\在正则表达式中的逃逸，推荐使用raw string literal，比如`r'C:\\x'`，而不是使用`'C:\\\\x`\n",
    "\n",
    "使用re.compile创建一个regex object是被强烈推荐的，如果你打算把一个表达式用于很多string上的话，这样可以节省CPU的资源。\n",
    "\n",
    "match和search，与findall关系紧密。不过findall会返回所有匹配的结果，而search只会返回第一次匹配的结果。更严格地说，match只匹配string开始的部分。这里举个例子说明，我们想要找到所有的邮件地址："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "text = \"\"\"Dave dave@google.com \n",
    "          Steve steve@gmail.com \n",
    "          Rob rob@gmail.com \n",
    "          Ryan ryan@yahoo.com \"\"\"\n",
    "\n",
    "pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# re.IGNORECASE makes the regex case-insensitive \n",
    "regex = re.compile(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用findall找到一组邮件地址："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regex.findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "search返回text中的第一个匹配结果。match object能告诉我们找到的结果在text中开始和结束的位置："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "m = regex.search(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'dave@google.com'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text[m.start():m.end()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "regex.match返回None，因为它只会在pattern存在于stirng开头的情况下才会返回匹配结果："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "print(regex.match(text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "而sub返回一个新的string，把pattern出现的地方替换为我们指定的string："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dave REDACTED \n",
      "          Steve REDACTED \n",
      "          Rob REDACTED \n",
      "          Ryan REDACTED \n"
     ]
    }
   ],
   "source": [
    "print(regex.sub('REDACTED', text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设你想要找到邮件地址，同时，想要把邮件地址分为三个部分，username, domain name, and domain suffix.（用户名，域名，域名后缀）。需要给每一个pattern加一个括号："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "regex = re.compile(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "match object会返回一个tuple，包含多个pattern组份，通过groups方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "m = regex.match('wesm@bright.net')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('wesm', 'bright', 'net')"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.groups()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "findall会返回a list of tuples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('dave', 'google', 'com'),\n",
       " ('steve', 'gmail', 'com'),\n",
       " ('rob', 'gmail', 'com'),\n",
       " ('ryan', 'yahoo', 'com')]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regex.findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "sub也能访问groups的结果，不过要使用特殊符号 \\1, \\2。\\1表示第一个匹配的group，\\2表示第二个匹配的group，以此类推："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dave Username: dave, Domain: google, Suffix: com \n",
      "          Steve Username: steve, Domain: gmail, Suffix: com \n",
      "          Rob Username: rob, Domain: gmail, Suffix: com \n",
      "          Ryan Username: ryan, Domain: yahoo, Suffix: com \n"
     ]
    }
   ],
   "source": [
    "print(regex.sub(r'Username: \\1, Domain: \\2, Suffix: \\3', text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里给一些正则表达式的方法：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/mj4vc.png)\n",
    "\n",
    "# 3 Vectorized String Functions in pandas（pandas中的字符串向量化函数）\n",
    "\n",
    "一些复杂的数据清理中，string会有缺失值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', \n",
    "        'Rob': 'rob@gmail.com', 'Wes': np.nan}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     dave@google.com\n",
       "Rob        rob@gmail.com\n",
       "Steve    steve@gmail.com\n",
       "Wes                  NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.Series(data)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     False\n",
       "Rob      False\n",
       "Steve    False\n",
       "Wes       True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以把一些字符串方法和正则表达式（用lambda或其他函数）用于每一个value上，通过data.map，但是这样会得到NA(null)值。为了解决这个问题，series有一些数组导向的方法可以用于字符串操作，来跳过NA值。这些方法可以通过series的str属性；比如，我们想检查每个电子邮箱地址是否有'gmail' with str.contains:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.strings.StringMethods at 0x111f305c0>"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.str"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     False\n",
       "Rob       True\n",
       "Steve     True\n",
       "Wes        NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.str.contains('gmail')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正则表达式也可以用，配合任意的re选项，比如IGNORECASE："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\\\.([A-Z]{2,4})'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     [(dave, google, com)]\n",
       "Rob        [(rob, gmail, com)]\n",
       "Steve    [(steve, gmail, com)]\n",
       "Wes                        NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.str.findall(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有很多方法用于向量化。比如str.get或index索引到str属性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.\n",
      "  if __name__ == '__main__':\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Dave     (dave, google, com)\n",
       "Rob        (rob, gmail, com)\n",
       "Steve    (steve, gmail, com)\n",
       "Wes                      NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "matches = data.str.match(pattern, flags=re.IGNORECASE)\n",
    "matches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了访问嵌套list里的元素，我们可以传入一个index给函数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     google\n",
       "Rob       gmail\n",
       "Steve     gmail\n",
       "Wes         NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "matches.str.get(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave      dave\n",
       "Rob        rob\n",
       "Steve    steve\n",
       "Wes        NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "matches.str.get(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "也可以使用这个语法进行切片："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     dave@\n",
       "Rob      rob@g\n",
       "Steve    steve\n",
       "Wes        NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.str[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里有一些字符串向量化的方法：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/owc7z.png)\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/cn2y0.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [py35]",
   "language": "python",
   "name": "Python [py35]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
